NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

A Universally Optimal Multistage Accelerated Stochastic Gradient Method

Aybat, Necdet Serhat; Fallah, Alireza; Gurbuzbalaban, Mert; Ozdaglar, Asuman (December 2019, Advances in neural information processing systems)
Wallach, H.; Larochelle, H.; Beygelzimer, A.; Fox, E.; Garnett, R. (Ed.)
Full Text Available
Kernel Truncated Randomized Ridge Regression: Optimal Rates and Low Noise Acceleration

Jun, Kwang-Sung; Cutkosky, Ashok; Orabona, Francesco (January 2019, Advances in neural information processing systems)
Wallach, H.; Larochelle, H.; Beygelzimer, A.; d'Alché-Buc, F.; null; Garnett, R. (Ed.)
In this paper, we consider the nonparametric least square regression in a Reproducing Kernel Hilbert Space (RKHS). We propose a new randomized algorithm that has optimal generalization error bounds with respect to the square loss, closing a long-standing gap between upper and lower bounds. Moreover, we show that our algorithm has faster finite-time and asymptotic rates on problems where the Bayes risk with respect to the square loss is small. We state our results using standard tools from the theory of least square regression in RKHSs, namely, the decay of the eigenvalues of the associated integral operator and the complexity of the optimal predictor measured through the integral operator.
more » « less
Full Text Available
Momentum-Based Variance Reduction in Non-Convex SGD

Cutkosky, Ashok; Orabona, Francesco (January 2019, Advances in neural information processing systems)
Wallach, H.; Larochelle, H.; Beygelzimer, A.; d'Alché-Buc, F.; Fox, E.; Garnett, R. (Ed.)
Variance reduction has emerged in recent years as a strong competitor to stochastic gradient descent in non-convex problems, providing the first algorithms to improve upon the converge rate of stochastic gradient descent for finding first-order critical points. However, variance reduction techniques typically require carefully tuned learning rates and willingness to use excessively large "mega-batches" in order to achieve their improved results. We present a new algorithm, STORM, that does not require any batches and makes use of adaptive learning rates, enabling simpler implementation and less hyperparameter tuning. Our technique for removing the batches uses a variant of momentum to achieve variance reduction in non-convex optimization. On smooth losses $$F$$, STORM finds a point $$\boldsymbol{x}$$ with $$E[\|\nabla F(\boldsymbol{x})\|]\le O(1/\sqrt{T}+\sigma^{1/3}/T^{1/3})$$ in $$T$$ iterations with $$\sigma^2$$ variance in the gradients, matching the optimal rate and without requiring knowledge of $$\sigma$$.
more » « less
Full Text Available
Unsupervised Meta-Learning for Few-Shot Image Classification

Khodadadeh, Siavash; Boloni, Ladislau; Shah, Mubarak (January 2019, Advances in neural information processing systems)
Wallach, H; Larochelle, H; Beygelzimer, A; d' Alché-Buc, F; Fox, E; Garnett, R (Ed.)
Full Text Available

Search for: All records